WEBVTT

00:00:07.961 --> 00:00:12.153
- Okay. Can everyone hear me?
Okay. Sorry for the delay.

00:00:12.153 --> 00:00:24.081
I had a bit of technical difficulty. Today was the first time I was trying to use my new touch bar Mac book pro for presenting,
and none of the adapters are working. So, I had to switch laptops at the last minute. So, thanks. Sorry about that.

00:00:25.353 --> 00:00:30.420
So, today is lecture 10. We're talking
about recurrent neural networks.

00:00:30.420 --> 00:00:33.003
So, as of, as usual, a
couple administrative notes.

00:00:33.003 --> 00:00:37.353
So, We're working hard
on assignment one grading.

00:00:37.353 --> 00:00:46.251
Those grades will probably be out sometime later today. Hopefully,
they can get out before the A2 deadline. That's what I'm hoping for.

00:00:46.251 --> 00:00:50.361
On a related note, Assignment
two is due today at 11:59 p.m.

00:00:50.361 --> 00:00:56.633
so, who's done with that already?
About half you guys.

00:00:56.633 --> 00:01:03.811
So, you remember, I did warn you when the assignment went out that
it was quite long, to start early. So, you were warned about that.

00:01:03.811 --> 00:01:06.561
But, hopefully, you guys
have some late days left.

00:01:06.561 --> 00:01:10.531
Also, as another reminder,
the midterm will be in class on Tuesday.

00:01:10.531 --> 00:01:15.881
If you kind of look around the lecture hall, there are not enough
seats in this room to seat all the enrolled students in the class.

00:01:15.881 --> 00:01:20.062
So, we'll actually be having the midterm in
several other lecture halls across campus.

00:01:20.062 --> 00:01:26.099
And we'll be sending out some more details on
exactly where to go in the next couple of days.

00:01:26.099 --> 00:01:28.179
So a bit of a, another
bit of announcement.

00:01:28.179 --> 00:01:34.950
We've been working on this sort of fun bit of extra credit thing for you to play with
that we're calling the training game. This is this cool browser based experience,

00:01:34.950 --> 00:01:39.927
where you can go in and interactively train neural
networks and tweak the hyper parameters during training.

00:01:39.927 --> 00:01:47.646
And this should be a really cool interactive way for you to practice some of these hyper
parameter tuning skills that we've been talking about the last couple of lectures.

00:01:47.646 --> 00:01:53.190
So this is not required, but this, I think, will be a really
useful experience to gain a little bit more intuition

00:01:53.190 --> 00:01:57.481
into how some of these hyper parameters work
for different types of data sets in practice.

00:01:57.481 --> 00:02:05.790
So we're still working on getting all the bugs worked out of this setup, and we'll probably
send out some more instructions on exactly how this will work in the next couple of days.

00:02:05.790 --> 00:02:11.008
But again, not required. But please do check it out. I think it'll
be really fun and a really cool thing for you to play with.

00:02:11.008 --> 00:02:17.204
And will give you a bit of extra credit if you do some, if you
end up working with this and doing a couple of runs with it.

00:02:18.208 --> 00:02:23.458
So, we'll again send out some more details about
this soon once we get all the bugs worked out.

00:02:24.720 --> 00:02:28.139
As a reminder, last time we were
talking about CNN Architectures.

00:02:28.139 --> 00:02:35.006
We kind of walked through the time line of some of the various winners of
the image net classification challenge, kind of the breakthrough result.

00:02:35.006 --> 00:02:41.331
As we saw was the AlexNet architecture in 2012, which was a
nine layer convolutional network. It did amazingly well,

00:02:41.331 --> 00:02:48.081
and it sort of kick started this whole deep learning revolution in computer
vision, and kind of brought a lot of these models into the mainstream.

00:02:48.081 --> 00:02:56.699
Then we skipped ahead a couple years, and saw that in 2014 image net challenge, we
had these two really interesting models, VGG and GoogLeNet, which were much deeper.

00:02:56.699 --> 00:03:02.930
So VGG was, they had a 16 and a 19 layer model,
and GoogLeNet was, I believe, a 22 layer model.

00:03:02.930 --> 00:03:11.230
Although one thing that is kind of interesting about these models is that the
2014 image net challenge was right before batch normalization was invented.

00:03:11.230 --> 00:03:18.761
So at this time, before the invention of batch normalization, training these
relatively deep models of roughly twenty layers was very challenging.

00:03:18.761 --> 00:03:24.869
So, in fact, both of these two models had to resort to a little
bit of hackery in order to get their deep models to converge.

00:03:24.869 --> 00:03:28.579
So for VGG, they had the
16 and 19 layer models,

00:03:28.579 --> 00:03:34.107
but actually they first trained an 11 layer model,
because that was what they could get to converge.

00:03:34.107 --> 00:03:40.059
And then added some extra random layers in the middle and then
continued training, actually training the 16 and 19 layer models.

00:03:40.059 --> 00:03:46.539
So, managing this training process was very challenging
in 2014 before the invention of batch normalization.

00:03:46.539 --> 00:03:52.539
Similarly, for GoogLeNet, we saw that GoogLeNet has these auxiliary
classifiers that were stuck into lower layers of the network.

00:03:52.539 --> 00:03:56.539
And these were not really needed for the class
to, to get good classification performance.

00:03:56.539 --> 00:04:03.430
This was just sort of a way to cause extra gradient to be
injected directly into the lower layers of the network.

00:04:03.430 --> 00:04:10.411
And this sort of, this again was before the invention of batch normalization
and now once you have these networks with batch normalization,

00:04:10.411 --> 00:04:17.321
then you no longer need these slightly ugly hacks
in order to get these deeper models to converge.

00:04:17.321 --> 00:04:24.350
Then we also saw in the 2015 image net challenge was this
really cool model called ResNet, these residual networks

00:04:24.350 --> 00:04:28.310
that now have these shortcut connections that
actually have these little residual blocks

00:04:28.310 --> 00:04:39.110
where we're going to take our input, pass it through the residual blocks, and then add that
output of the, then add our input to the block, to the output from these convolutional layers.

00:04:39.110 --> 00:04:43.308
This is kind of a funny architecture, but
it actually has two really nice properties.

00:04:43.308 --> 00:04:49.531
One is that if we just set all the weights in this residual
block to zero, then this block is competing the identity.

00:04:49.531 --> 00:04:55.681
So in some way, it's relatively easy for this model
to learn not to use the layers that it doesn't need.

00:04:55.681 --> 00:05:02.171
In addition, it kind of adds this interpretation to L2
regularization in the context of these neural networks,

00:05:02.171 --> 00:05:08.321
cause once you put L2 regularization, remember, on your, on the weights
of your network, that's going to drive all the parameters towards zero.

00:05:08.321 --> 00:05:12.739
And maybe your standard convolutional architecture is
driving towards zero. Maybe it doesn't make sense.

00:05:12.739 --> 00:05:20.510
But in the context of a residual network, if you drive all the parameters towards
zero, that's kind of encouraging the model to not use layers that it doesn't need,

00:05:20.510 --> 00:05:26.310
because it will just drive those, the residual blocks towards
the identity, whether or not needed for classification.

00:05:26.310 --> 00:05:31.371
The other really useful property of these residual networks
has to do with the gradient flow in the backward paths.

00:05:31.371 --> 00:05:34.361
If you remember what happens at these
addition gates in the backward pass,

00:05:34.361 --> 00:05:39.881
when upstream gradient is coming in through an addition gate,
then it will split and fork along these two different paths.

00:05:39.881 --> 00:05:46.361
So then, when upstream gradient comes in, it'll
take one path through these convolutional blocks,

00:05:46.361 --> 00:05:50.811
but it will also have a direct connection of
the gradient through this residual connection.

00:05:50.811 --> 00:05:59.150
So then when you look at, when you imagine stacking many of these residual blocks on top
of each other, and our network ends up with hundreds of, potentially hundreds of layers.

00:05:59.150 --> 00:06:05.561
Then, these residual connections give a sort of gradient super
highway for gradients to flow backward through the entire network.

00:06:05.561 --> 00:06:09.630
And this allows it to train much easier
and much faster.

00:06:09.630 --> 00:06:15.738
And actually allows these things to converge reasonably well,
even when the model is potentially hundreds of layers deep.

00:06:15.738 --> 00:06:21.550
And this idea of managing gradient flow in your models is
actually super important everywhere in machine learning.

00:06:21.550 --> 00:06:28.564
And super prevalent in recurrent networks as well. So we'll definitely
revisit this idea of gradient flow later in today's lecture.

00:06:31.148 --> 00:06:38.068
So then, we kind of also saw a couple other more exotic, more recent
CNN architectures last time, including DenseNet and FractalNet,

00:06:38.068 --> 00:06:43.070
and once you think about these architectures in terms
of gradient flow, they make a little bit more sense.

00:06:43.070 --> 00:06:48.619
These things like DenseNet and FractalNet are adding these
additional shortcut or identity connections inside the model.

00:06:48.619 --> 00:07:00.571
And if you think about what happens in the backwards pass for these models, these additional funny topologies are basically providing
direct paths for gradients to flow from the loss at the end of the network more easily into all the different layers of the network.

00:07:00.571 --> 00:07:09.760
So I think that, again, this idea of managing gradient flow properly in your CNN
Architectures is something that we've really seen a lot more in the last couple of years.

00:07:09.760 --> 00:07:15.221
And will probably see more moving forward
as more exotic architectures are invented.

00:07:16.257 --> 00:07:24.331
We also saw this kind of nice plot, plotting performance of the number of flops
versus the number of parameters versus the run time of these various models.

00:07:24.331 --> 00:07:27.971
And there's some interesting characteristics
that you can dive in and see from this plot.

00:07:27.971 --> 00:07:32.801
One idea is that VGG and AlexNet
have a huge number of parameters,

00:07:32.801 --> 00:07:37.119
and these parameters actually come almost entirely
from the fully connected layers of the models.

00:07:37.119 --> 00:07:39.959
So AlexNet has something like
roughly 62 million parameters,

00:07:39.959 --> 00:07:47.771
and if you look at that last fully connected layer, the final fully connected
layer in AlexNet is going from an activation volume of six by six by 256

00:07:47.771 --> 00:07:51.190
into this fully connected vector of 496.

00:07:51.190 --> 00:07:56.851
So if you imagine what the weight matrix needs to look
like at that layer, the weight matrix is gigantic.

00:07:56.851 --> 00:08:01.921
It's number of entries is six by six,
six times six times 256 times 496.

00:08:01.921 --> 00:08:06.370
And if you multiply that out, you see that
that single layer has 38 million parameters.

00:08:06.370 --> 00:08:11.859
So more than half of the parameters of the entire AlexNet
model are just sitting in that last fully connected layer.

00:08:11.859 --> 00:08:24.241
And if you add up all the parameters in just the fully connected layers of AlexNet, including these other fully connected
layers, you see something like 59 of the 62 million parameters in AlexNet are sitting in these fully connected layers.

00:08:24.241 --> 00:08:31.110
So then when we move other architectures, like GoogLeNet and ResNet,
they do away with a lot of these large fully connected layers

00:08:31.110 --> 00:08:33.698
in favor of global average pooling
at the end of the network.

00:08:33.698 --> 00:08:40.935
And this allows these networks to really cut, these nicer architectures,
to really cut down the parameter count in these architectures.

00:08:44.463 --> 00:08:49.604
So that was kind of our brief recap of the CNN
architectures that we saw last lecture, and then today,

00:08:49.604 --> 00:08:56.321
we're going to move to one of my favorite topics
to talk about, which is recurrent neural networks.

00:08:56.321 --> 00:09:03.222
So, so far in this class, we've seen, what I like to think of as kind of a
vanilla feed forward network, all of our network architectures have this flavor,

00:09:03.222 --> 00:09:08.593
where we receive some input and that input is
a fixed size object, like an image or vector.

00:09:08.593 --> 00:09:13.850
That input is fed through some set of
hidden layers and produces a single output,

00:09:13.850 --> 00:09:18.876
like a classifications, like a set of
classifications scores over a set of categories.

00:09:20.071 --> 00:09:25.942
But in some context in machine learning, we want to have more
flexibility in the types of data that our models can process.

00:09:25.942 --> 00:09:35.313
So once we move to this idea of recurrent neural networks, we have a lot more opportunities
to play around with the types of input and output data that our networks can handle.

00:09:35.313 --> 00:09:41.009
So once we have recurrent neural networks, we
can do what we call these one to many models.

00:09:41.009 --> 00:09:48.721
Or where maybe our input is some object of fixed size, like an image,
but now our output is a sequence of variable length, such as a caption.

00:09:48.721 --> 00:09:54.081
Where different captions might have different numbers
of words, so our output needs to be variable in length.

00:09:54.081 --> 00:09:56.491
We also might have many to one models,

00:09:56.491 --> 00:10:01.001
where our input could be variably sized. This
might be something like a piece of text,

00:10:01.001 --> 00:10:06.161
and we want to say what is the sentiment of that text,
whether it's positive or negative in sentiment.

00:10:06.161 --> 00:10:12.512
Or in a computer vision context, you might imagine taking as input
a video, and that video might have a variable number of frames.

00:10:12.512 --> 00:10:16.401
And now we want to read this entire video
of potentially variable length.

00:10:16.401 --> 00:10:22.721
And then at the end, make a classification decision about maybe
what kind of activity or action is going on in that video.

00:10:22.721 --> 00:10:29.931
We also have a, we might also have problems where we want
both the inputs and the output to be variable in length.

00:10:29.931 --> 00:10:37.302
We might see something like this in machine translation, where our input
is some, maybe, sentence in English, which could have a variable length,

00:10:37.302 --> 00:10:41.633
and our output is maybe some sentence in French,
which also could have a variable length.

00:10:41.633 --> 00:10:46.801
And crucially, the length of the English sentence might
be different from the length of the French sentence.

00:10:46.801 --> 00:10:53.931
So we need some models that have the capacity to accept both
variable length sequences on the input and on the output.

00:10:53.931 --> 00:11:04.771
Finally, we might also consider problems where our input is variably length, like something like a video sequence
with a variable number of frames. And now we want to make a decision for each element of that input sequence.

00:11:04.771 --> 00:11:11.891
So in the context of videos, that might be making some
classification decision along every frame of the video.

00:11:11.891 --> 00:11:17.401
And recurrent neural networks are this kind of general
paradigm for handling variable sized sequence data

00:11:17.401 --> 00:11:23.469
that allow us to pretty naturally capture all of
these different types of setups in our models.

00:11:24.349 --> 00:11:33.752
So recurring neural networks are actually important, even for some problems that have a fixed
size input and a fixed size output. Recurrent neural networks can still be pretty useful.

00:11:33.752 --> 00:11:38.793
So in this example, we might want to do, for
example, sequential processing of our input.

00:11:38.793 --> 00:11:46.227
So here, we're receiving a fixed size input like an image, and we want to make a
classification decision about, like, what number is being shown in this image?

00:11:46.227 --> 00:11:50.393
But now, rather than just doing a single feed
forward pass and making the decision all at once,

00:11:50.393 --> 00:11:55.553
this network is actually looking around the image and
taking various glimpses of different parts of the image.

00:11:55.553 --> 00:12:01.742
And then after making some series of glimpses, then it makes
its final decision as to what kind of number is present.

00:12:01.742 --> 00:12:17.473
So here, we had one, So here, even though our input and outputs, our input was an image, and our output was a classification decision, even this context,
this idea of being able to handle variably length processing with recurrent neural networks can lead to some really interesting types of models.

00:12:17.473 --> 00:12:23.923
There's a really cool paper that I like that applied
this same type of idea to generating new images.

00:12:23.923 --> 00:12:29.723
Where now, we want the model to synthesize brand new images
that look kind of like the images it saw in training,

00:12:29.723 --> 00:12:36.254
and we can use a recurrent neural network architecture to actually
paint these output images sort of one piece at a time in the output.

00:12:36.254 --> 00:12:46.380
You can see that, even though our output is this fixed size image, we can have these models
that are working over time to compute parts of the output one at a time sequentially.

00:12:46.380 --> 00:12:51.662
And we can use recurrent neural networds
for that type of setup as well.

00:12:51.662 --> 00:12:58.785
So after this sort of cool pitch about all these cool things that
RNNs can do, you might wonder, like what exactly are these things?

00:12:58.785 --> 00:13:04.163
So in general, a recurrent neural network is
this little, has this little recurrent core cell

00:13:04.163 --> 00:13:11.382
and it will take some input x, feed that input into
the RNN, and that RNN has some internal hidden state,

00:13:11.382 --> 00:13:17.641
and that internal hidden state will be updated
every time that the RNN reads a new input.

00:13:17.641 --> 00:13:23.980
And that internal hidden state will be then fed
back to the model the next time it reads an input.

00:13:23.980 --> 00:13:28.822
And frequently, we will want our RNN"s to
also produce some output at every time step,

00:13:28.822 --> 00:13:31.043
so we'll have this pattern
where it will read an input,

00:13:31.043 --> 00:13:34.469
update its hidden state,
and then produce an output.

00:13:35.814 --> 00:13:40.961
So then the question is what is the functional form
of this recurrence relation that we're computing?

00:13:40.961 --> 00:13:46.443
So inside this little green RNN block, we're computing
some recurrence relation, with a function f.

00:13:46.443 --> 00:13:49.094
So this function f will
depend on some weights, w.

00:13:49.094 --> 00:13:55.374
It will accept the previous hidden state, h t - 1,
as well as the input at the current state, x t,

00:13:55.374 --> 00:14:01.420
and this will output the next hidden state, or
the updated hidden state, that we call h t.

00:14:01.420 --> 00:14:11.552
And now, then as we read the next input, this hidden state, this new hidden state, h t,
will then just be passed into the same function as we read the next input, x t plus one.

00:14:11.552 --> 00:14:21.797
And now, if we wanted to produce some output at every time step of this network, we might
attach some additional fully connected layers that read in this h t at every time step.

00:14:21.797 --> 00:14:27.327
And make that decision based
on the hidden state at every time step.

00:14:27.327 --> 00:14:35.662
And one thing to note is that we use the same function, f w, and
the same weights, w, at every time step of the computation.

00:14:36.921 --> 00:14:43.434
So then kind of the simplest function form that you can imagine
is what we call this vanilla recurrent neural network.

00:14:43.434 --> 00:14:46.866
So here, we have this same functional form
from the previous slide,

00:14:46.866 --> 00:14:52.483
where we're taking in our previous hidden state and our
current input and we need to produce the next hidden state.

00:14:52.483 --> 00:15:00.124
And the kind of simplest thing you might imagine is that we have
some weight matrix, w x h, that we multiply against the input, x t,

00:15:00.124 --> 00:15:05.615
as well as another weight matrix, w h h, that
we multiply against the previous hidden state.

00:15:05.615 --> 00:15:09.327
So we make these two multiplications
against our two states, add them together,

00:15:09.327 --> 00:15:13.514
and squash them through a tanh, so we get
some kind of non linearity in the system.

00:15:13.514 --> 00:15:17.312
You might be wondering why we use a tanh here
and not some other type of non-linearity?

00:15:17.312 --> 00:15:20.594
After all that we've said negative
about tanh's in previous lectures,

00:15:20.594 --> 00:15:26.507
and I think we'll return a little bit to that later on
when we talk about more advanced architectures, like lstm.

00:15:27.346 --> 00:15:33.394
So then, this, So then, in addition in this architecture,
if we wanted to produce some y t at every time step,

00:15:33.394 --> 00:15:40.375
you might have another weight matrix, w, you might have another weight
matrix that accepts this hidden state and then transforms it to some y

00:15:40.375 --> 00:15:44.826
to produce maybe some class score
predictions at every time step.

00:15:44.826 --> 00:15:51.487
And when I think about recurrent neural networks, I kind of think about, you
can also, you can kind of think of recurrent neural networks in two ways.

00:15:51.487 --> 00:15:57.095
One is this concept of having a hidden state
that feeds back at itself, recurrently.

00:15:57.095 --> 00:16:05.914
But I find that picture a little bit confusing. And sometimes, I find it clearer
to think about unrolling this computational graph for multiple time steps.

00:16:05.914 --> 00:16:11.786
And this makes the data flow of the hidden states and the inputs
and the outputs and the weights maybe a little bit more clear.

00:16:11.786 --> 00:16:15.494
So then at the first time step, we'll
have some initial hidden state h zero.

00:16:15.494 --> 00:16:22.415
This is usually initialized to zeros for most context,
in most contexts, an then we'll have some input, x t.

00:16:22.415 --> 00:16:28.324
This initial hidden state, h zero, and our current
input, x t, will go into our f w function.

00:16:28.324 --> 00:16:36.154
This will produce our next hidden state, h one. And then,
we'll repeat this process when we receive the next input.

00:16:36.154 --> 00:16:42.847
So now our current h one and our x one, will go into
that same f w, to produce our next output, h two.

00:16:42.847 --> 00:16:50.866
And this process will repeat over and over again, as we
consume all of the input, x ts, in our sequence of inputs.

00:16:50.866 --> 00:16:58.036
And now, one thing to note, is that we can actually make this even
more explicit and write the w matrix in our computational graph.

00:16:58.036 --> 00:17:03.415
And here you can see that we're re-using the same
w matrix at every time step of the computation.

00:17:03.415 --> 00:17:11.006
So now every time that we have this little f w block, it's receiving a
unique h and a unique x, but all of these blocks are taking the same w.

00:17:11.007 --> 00:17:20.786
And if you remember, we talked about how gradient flows in back propagation, when you
re-use the same, when you re-use the same node multiple times in a computational graph,

00:17:20.786 --> 00:17:28.218
then remember during the backward pass, you end up summing the
gradients into the w matrix when you're computing a d los d w.

00:17:28.218 --> 00:17:32.526
So, if you kind of think about
the back propagation for this model,

00:17:32.526 --> 00:17:42.503
then you'll have a separate gradient for w flowing from each of those time steps, and then
the final gradient for w will be the sum of all of those individual per time step gradiants.

00:17:43.615 --> 00:17:47.727
We can also write to this y t explicitly
in this computational graph.

00:17:47.727 --> 00:17:54.858
So then, this output, h t, at every time step might feed into
some other little neural network that can produce a y t,

00:17:54.858 --> 00:17:59.087
which might be some class scores, or
something like that, at every time step.

00:17:59.087 --> 00:18:00.738
We can also make the loss more explicit.

00:18:00.738 --> 00:18:14.068
So in many cases, you might imagine producing, you might imagine that you have some ground truth label at every time step
of your sequence, and then you'll compute some loss, some individual loss, at every time step of these outputs, y t's.

00:18:14.068 --> 00:18:22.497
And this loss might, it will frequently be something like soft max loss, in the case
where you have, maybe, a ground truth label at every time step of the sequence.

00:18:22.497 --> 00:18:27.887
And now the final loss for the entire, for this entire
training stop, will be the sum of these individual losses.

00:18:27.887 --> 00:18:34.196
So now, we had a scaler loss at every time step? And we just summed
them up to get our final scaler loss at the top of the network.

00:18:34.196 --> 00:18:42.098
And now, if you think about, again, back propagation through this thing, we need, in
order to train the model, we need to compute the gradient of the loss with respect to w.

00:18:42.098 --> 00:18:46.178
So, we'll have loss flowing from that
final loss into each of these time steps.

00:18:46.178 --> 00:18:49.840
And then each of those time steps will
compute a local gradient on the weights, w,

00:18:49.840 --> 00:18:54.343
which will all then be summed to give us
our final gradient for the weights, w.

00:18:55.597 --> 00:19:01.188
Now if we have a, sort of, this many to one situation, where
maybe we want to do something like sentiment analysis,

00:19:01.188 --> 00:19:05.799
then we would typically make that decision based
on the final hidden state of this network.

00:19:05.799 --> 00:19:11.868
Because this final hidden state kind of summarizes
all of the context from the entire sequence.

00:19:11.868 --> 00:19:14.788
Also, if we have a kind of
a one to many situation,

00:19:14.788 --> 00:19:19.319
where we want to receive a fix sized input
and then produce a variably sized output.

00:19:19.319 --> 00:19:26.050
Then you'll commonly use that fixed size input to
initialize, somehow, the initial hidden state of the model,

00:19:26.050 --> 00:19:30.079
and now the recurrent network will tick
for each cell in the output.

00:19:30.079 --> 00:19:36.915
And now, as you produce your variably sized output,
you'll unroll the graph for each element in the output.

00:19:38.490 --> 00:19:44.308
So this, when we talk about the sequence to sequence models
where you might do something like machine translation,

00:19:44.308 --> 00:19:47.648
where you take a variably sized input
and a variably sized output.

00:19:47.648 --> 00:19:52.398
You can think of this as a combination
of the many to one, plus a one to many.

00:19:52.398 --> 00:19:56.900
So, we'll kind of proceed in two stages,
what we call an encoder and a decoder.

00:19:56.900 --> 00:20:02.159
So if you're the encoder, we'll receive the variably
sized input, which might be your sentence in English,

00:20:02.159 --> 00:20:08.110
and then summarize that entire sentence using
the final hidden state of the encoder network.

00:20:08.110 --> 00:20:15.769
And now we're in this many to one situation where we've summarized
this entire variably sized input in this single vector,

00:20:15.769 --> 00:20:23.111
and now, we have a second decoder network, which is a one to many situation,
which will input that single vector summarizing the input sentence

00:20:23.111 --> 00:20:28.969
and now produce this variably sized output, which
might be your sentence in another language.

00:20:28.969 --> 00:20:34.609
And now in this variably sized output, we might make some
predictions at every time step, maybe about what word to use.

00:20:34.609 --> 00:20:38.199
And you can imagine kind of training this entire
thing by unrolling this computational graph

00:20:38.199 --> 00:20:44.692
summing the losses at the output sequence and
just performing back propagation, as usual.

00:20:44.692 --> 00:20:50.940
So as a bit of a concrete example, one thing that we frequently use
recurrent neural networks for, is this problem called language modeling.

00:20:50.940 --> 00:21:00.908
So in the language modeling problem, we want to read some sequence of, we want
to have our network, sort of, understand how to produce natural language.

00:21:00.908 --> 00:21:06.601
So in the, so this, this might happen at the character level
where our model will produce characters one at a time.

00:21:06.601 --> 00:21:10.769
This might also happen at the word level where
our model will produce words one at a time.

00:21:10.769 --> 00:21:14.740
But in a very simple example, you can
imagine this character level language model

00:21:14.740 --> 00:21:22.780
where we want, where the network will read some sequence of characters and then
it needs to predict, what will the next character be in this stream of text?

00:21:22.780 --> 00:21:33.884
So in this example, we have this very small vocabulary of four letters, h, e, l, and
o, and we have this example training sequence of the word hello, h, e, l, l, o.

00:21:33.884 --> 00:21:49.689
So during training, when we're training this language model, we will feed the characters of this training
sequence as inputs, as x ts, to out input of our, we'll feed the characters of our training sequence,

00:21:49.689 --> 00:21:53.980
these will be the x ts that we feed in as
the inputs to our recurrent neural network.

00:21:53.980 --> 00:22:01.039
And then, each of these inputs, it's a letter, and we need
to figure out a way to represent letters in our network.

00:22:01.039 --> 00:22:07.460
So what we'll typically do is figure out what is our total
vocabulary. In this case, our vocabulary has four elements.

00:22:07.460 --> 00:22:12.589
And each letter will be represented by a
vector that has zeros in every slot but one,

00:22:12.589 --> 00:22:16.628
and a one for the slot in the vocabulary
corresponding to that letter.

00:22:16.628 --> 00:22:22.324
In this little example, since our vocab has the
four letters, h, e, l, o, then our input sequence,

00:22:22.324 --> 00:22:28.684
the h is represented by a four element vector with a one
in the first slot and zero's in the other three slots.

00:22:28.684 --> 00:22:33.139
And we use the same sort of pattern to represent
all the different letters in the input sequence.

00:22:34.914 --> 00:22:41.874
Now, during this forward pass of what this network is doing,
at the first time step, it will receive the input letter h.

00:22:41.874 --> 00:22:48.594
That will go into the first RNN, to the RNN
cell, and then we'll produce this output, y t,

00:22:48.594 --> 00:22:56.024
which is the network making predictions about for each letter in the
vocabulary, which letter does it think is most likely going to come next.

00:22:56.024 --> 00:23:01.405
In this example, the correct output letter was
e because our training sequence was hello,

00:23:01.405 --> 00:23:06.861
but the model is actually predicting, I think it's
actually predicting o as the most likely letter.

00:23:07.850 --> 00:23:13.889
So in this case, this prediction was wrong and we would use
softmaxt loss to quantify our unhappiness with these predictions.

00:23:13.889 --> 00:23:19.741
The next time step, we would feed in the second letter in
the training sequence, e, and this process will repeat.

00:23:19.741 --> 00:23:27.271
We'll now represent e as a vector. Use that input vector together
with the previous hidden state to produce a new hidden state

00:23:27.271 --> 00:23:31.912
and now use the second hidden state to, again, make
predictions over every letter in the vocabulary.

00:23:31.912 --> 00:23:36.810
In this case, because our training sequence was hello,
after the letter e, we want our model to predict l.

00:23:36.810 --> 00:23:41.954
In this case, our model may have very low predictions
for the letter l, so we would incur high loss.

00:23:44.244 --> 00:23:50.343
And you kind of repeat this process over and over, and
if you train this model with many different sequences,

00:23:50.343 --> 00:23:58.596
then eventually it should learn how to predict the next character in a sequence
based on the context of all the previous characters that it's seen before.

00:23:59.893 --> 00:24:01.655
And now, if you think about
what happens at test time,

00:24:01.655 --> 00:24:07.594
after we train this model, one thing that we might
want to do with it is a sample from the model,

00:24:07.594 --> 00:24:15.103
and actually use this trained neural network model to synthesize new text
that kind of looks similar in spirit to the text that it was trained on.

00:24:15.103 --> 00:24:22.716
The way that this will work is we'll typically see the model with some input
prefix of text. In this case, the prefix is just the single letter h,

00:24:22.716 --> 00:24:27.295
and now we'll feed that letter h through the
first time step of our recurrent neural network.

00:24:27.295 --> 00:24:32.916
It will product this distribution of scores
over all the characters in the vocabulary.

00:24:32.916 --> 00:24:37.501
Now, at training time, we'll use these
scores to actually sample from it.

00:24:37.501 --> 00:24:41.421
So we'll use a softmaxt function to convert
those scores into a probability distribution

00:24:41.421 --> 00:24:47.362
and then we will sample from that probability distribution
to actually synthesize the second letter in the sequence.

00:24:47.362 --> 00:24:54.771
And in this case, even though the scores were pretty bad, maybe we got
lucky and sampled the letter e from this probability distribution.

00:24:54.771 --> 00:25:02.492
And now, we'll take this letter e that was sampled from this distribution
and feed it back as input into the network at the next time step.

00:25:02.492 --> 00:25:15.008
Now, we'll take this e, pull it down from the top, feed it back into the network as one of these, sort of, one hot
vectorial representations, and then repeat the process in order to synthesize the second letter in the output.

00:25:15.008 --> 00:25:20.722
And we can repeat this process over and over again to
synthesize a new sequence using this trained model,

00:25:20.722 --> 00:25:27.712
where we're synthesizing the sequence one character at a time using
these predicted probability distributions at each time step.

00:25:27.712 --> 00:25:28.545
Question?

00:25:34.792 --> 00:25:41.315
Yeah, that's a great question. So the question is why might we
sample instead of just taking the character with the largest score?

00:25:41.315 --> 00:25:46.555
In this case, because of the probability distribution that
we had, it was impossible to get the right character,

00:25:47.451 --> 00:25:51.512
so we had the sample so the example could
work out, and it would make sense.

00:25:51.512 --> 00:25:54.384
But in practice,
sometimes you'll see both.

00:25:54.384 --> 00:25:59.482
So sometimes you'll just take the argmax probability,
and that will sometimes be a little bit more stable,

00:25:59.482 --> 00:26:04.264
but one advantage of sampling, in general, is
that it lets you get diversity from your models.

00:26:04.264 --> 00:26:11.421
Sometimes you might have the same input, maybe the same prefix,
or in the case of image captioning, maybe the same image.

00:26:11.421 --> 00:26:20.032
But then if you sample rather than taking the argmax, then you'll see that sometimes these trained
models are actually able to produce multiple different types of reasonable output sequences,

00:26:20.032 --> 00:26:23.824
depending on the kind, depending on which
samples they take at the first time steps.

00:26:23.824 --> 00:26:29.213
It's actually kind of a benefit cause we
can get now more diversity in our outputs.

00:26:29.213 --> 00:26:30.630
Another question?

00:26:35.143 --> 00:26:40.435
Could we feed in the softmax vector instead of
the one element vector? You mean at test time?

00:26:46.162 --> 00:26:51.373
Yeah yeah, so the question is, at test time, could we feed
in this whole softmax vector rather than a one hot vector?

00:26:51.373 --> 00:26:56.782
There's kind of two problems with that. One is that that's
very different from the data that it saw at training time.

00:26:56.782 --> 00:27:05.413
In general, if you ask your model to do something at test time, which is different from training
time, then it'll usually blow up. It'll usually give you garbage and you'll usually be sad.

00:27:05.413 --> 00:27:09.112
The other problem is that in practice,
our vocabularies might be very large.

00:27:09.112 --> 00:27:13.202
So maybe, in this simple example, our vocabulary
is only four elements, so it's not a big problem.

00:27:13.202 --> 00:27:18.773
But if you're thinking about generating words one at a time,
now your vocabulary is every word in the English language,

00:27:18.773 --> 00:27:21.162
which could be something like
tens of thousands of elements.

00:27:21.162 --> 00:27:30.533
So in practice, this first element, this first operation that's taking in this one hot
vector, is often performed using sparse vector operations rather than dense factors.

00:27:30.533 --> 00:27:36.827
It would be, sort of, computationally really bad if you
wanted to have this load of 10,000 elements softmax vector.

00:27:36.827 --> 00:27:40.302
So that's usually why we use a one
hot instead, even at test time.

00:27:42.121 --> 00:27:47.104
This idea that we have a sequence and we produce
an output at every time step of the sequence

00:27:47.104 --> 00:27:51.543
and then finally compute some loss, this is
sometimes called backpropagation through time

00:27:51.543 --> 00:27:57.053
because you're imagining that in the forward pass, you're kind of
stepping forward through time and then during the backward pass,

00:27:57.053 --> 00:28:00.762
you're sort of going backwards through
time to compute all your gradients.

00:28:00.762 --> 00:28:06.162
This can actually be kind of problematic if you want
to train the sequences that are very, very long.

00:28:06.162 --> 00:28:15.453
So if you imagine that we were kind of trying to train a neural network language model on maybe
the entire text of Wikipedia, which is, by the way, something that people do pretty frequently,

00:28:15.453 --> 00:28:23.328
this would be super slow, and every time we made a gradient step, we would
have to make a forward pass through the entire text of all of wikipedia,

00:28:23.328 --> 00:28:27.813
and then make a backward pass through all of
wikipedia, and then make a single gradient update.

00:28:27.813 --> 00:28:34.172
And that would be super slow. Your model would never converge. It would
also take a ridiculous amount of memory so this would be just really bad.

00:28:34.172 --> 00:28:39.933
In practice, what people do is this, sort of, approximation
called truncated backpropagation through time.

00:28:39.933 --> 00:28:45.562
Here, the idea is that, even though our input sequence
is very, very long, and even potentially infinite,

00:28:45.562 --> 00:28:56.232
what we'll do is that during, when we're training the model, we'll step forward for some
number of steps, maybe like a hundred is kind of a ballpark number that people frequently use,

00:28:56.232 --> 00:29:06.261
and we'll step forward for maybe a hundred steps, compute a loss only over this sub sequence
of the data, and then back propagate through this sub sequence, and now make a gradient step.

00:29:06.261 --> 00:29:12.064
And now, when we repeat, well, we still have these
hidden states that we computed from the first batch,

00:29:12.064 --> 00:29:20.631
and now, when we compute this next batch of data, we will carry those hidden
states forward in time, so the forward pass will be exactly the same.

00:29:20.631 --> 00:29:28.124
But now when we compute a gradient step for this next batch of
data, we will only backpropagate again through this second batch.

00:29:28.124 --> 00:29:32.760
Now, we'll make a gradient step based on
this truncated backpropagation through time.

00:29:32.760 --> 00:29:38.250
This process will continue, where now when we make the
next batch, we'll again copy these hidden states forward,

00:29:38.250 --> 00:29:43.840
but then step forward and then step backward,
but only for some small number of time steps.

00:29:43.840 --> 00:29:49.872
So this is, you can kind of think of this as being an alegist
who's the cast at gradient descent in the case of sequences.

00:29:49.872 --> 00:29:53.690
Remember, when we talked about training
our models on large data sets,

00:29:53.690 --> 00:29:58.720
then these data sets, it would be super expensive to
compute the gradients over every element in the data set.

00:29:58.720 --> 00:30:02.520
So instead, we kind of take small samples,
small mini batches instead,

00:30:02.520 --> 00:30:08.053
and use mini batches of data to compute gradient
stops in any kind of image classification case.

00:30:08.053 --> 00:30:08.886
Question?

00:30:12.441 --> 00:30:15.989
Is this kind of, the question is, is this
kind of making the Mark Hobb assumption?

00:30:15.989 --> 00:30:20.581
No, not really. Because we're carrying
this hidden state forward in time forever.

00:30:21.442 --> 00:30:25.792
It's making a Marcovian assumption in the
sense that, conditioned on the hidden state,

00:30:25.792 --> 00:30:31.101
but the hidden state is all that we need to
predict the entire future of the sequence.

00:30:31.101 --> 00:30:35.941
But that assumption is kind of built into the
recurrent neural network formula from the start.

00:30:35.941 --> 00:30:39.032
And that's not really particular
to back propagation through time.

00:30:39.032 --> 00:30:51.239
Back propagation through time, or sorry, truncated back prop though time is just the way to approximate
these gradients without going making a backwards pass through your potentially very large sequence of data.

00:30:52.677 --> 00:30:59.649
This all sounds very complicated and confusing and it sounds like a lot
of code to write, but in fact, this can acutally be pretty concise.

00:30:59.649 --> 00:31:07.474
Andrea has this example of what he calls min-char-rnn, that
does all of this stuff in just like a 112 lines of Python.

00:31:07.474 --> 00:31:11.725
It handles building the vocabulary. It trains the
model with truncated back propagation through time.

00:31:11.725 --> 00:31:16.584
And then, it can actually sample from
that model in actually not too much code.

00:31:16.584 --> 00:31:20.862
So even though this sounds like kind of a big,
scary process, it's actually not too difficult.

00:31:20.862 --> 00:31:27.954
I'd encourage you, if you're confused, to maybe go check this out and step through the
code on your own time, and see, kind of, all of these concrete steps happening in code.

00:31:27.954 --> 00:31:34.473
So this is all in just a single file, all using numpy
with no dependencies. This was relatively easy to read.

00:31:35.584 --> 00:31:41.593
So then, once we have this idea of training a recurrent neural
network language model, we can actually have a lot of fun with this.

00:31:41.593 --> 00:31:52.304
And we can take in, sort of, any text that we want. Take in, like, whatever random text you can think of from
the internet, train our recurrent neural network language model on this text, and then generate new text.

00:31:52.304 --> 00:32:02.634
So in this example, we took this entire text of all of Shakespeare's works, and then
used that to train a recurrent neural network language model on all of Shakespeare.

00:32:02.634 --> 00:32:11.584
And you can see that the beginning of training, it's kind of producing maybe random gibberish garbage,
but throughout the course of training, it ends up producing things that seem relatively reasonable.

00:32:11.584 --> 00:32:18.224
And after you've, after this model has been trained pretty well,
then it produces text that seems, kind of, Shakespeare-esque to me.

00:32:18.224 --> 00:32:24.754
"Why do what that day," replied, whatever, right, you can
read this. Like, it kind of looks kind of like Shakespeare.

00:32:24.754 --> 00:32:31.264
And if you actually train this model even more, and let it converge
even further, and then sample these even longer sequences,

00:32:31.264 --> 00:32:35.864
you can see that it learns all kinds of crazy cool
stuff that really looks like a Shakespeare play.

00:32:35.864 --> 00:32:40.016
It knows that it uses, maybe, these
headings to say who's speaking.

00:32:40.016 --> 00:32:45.565
Then it produces these bits of text that have crazy
dialogue that sounds kind of Shakespeare-esque.

00:32:45.565 --> 00:32:47.744
It knows to put line breaks
in between these different things.

00:32:47.744 --> 00:32:52.958
And this is all, like, really cool, all just
sort of learned from the structure of the data.

00:32:52.958 --> 00:33:02.725
We can actually get even crazier than this. This was one of my favorite
examples. I found online, there's this. Is anyone a mathematician in this room?

00:33:02.725 --> 00:33:07.325
Has anyone taken an algebraic topology course by
any chance? Wow, a couple, that's impressive.

00:33:07.325 --> 00:33:15.114
So you probably know more algebraic topology than me, but I
found this open source algebraic topology textbook online.

00:33:15.114 --> 00:33:19.554
It's just a whole bunch of tech files that
are like this super dense mathematics.

00:33:19.554 --> 00:33:26.325
And LaTac, cause LaTac is sort of this, let's you write
equations and diagrams and everything just using plain text.

00:33:26.325 --> 00:33:33.146
We can actually train our recurrent neural network language model
on the raw Latac source code of this algebraic topology textbook.

00:33:33.146 --> 00:33:41.446
And if we do that, then after we sample from the model, then we
get something that seems like, kind of like algebraic topology.

00:33:41.446 --> 00:33:46.679
So it knows to like put equations.
It puts all kinds of crazy stuff.

00:33:46.679 --> 00:33:52.773
It's like, to prove study, we see that F sub U is a
covering of x prime, blah, blah, blah, blah, blah.

00:33:52.773 --> 00:33:56.576
It knows where to put unions. It knows
to put squares at the end of proofs.

00:33:56.576 --> 00:34:00.026
It makes lemmas.
It makes references to previous lemmas.

00:34:00.026 --> 00:34:08.417
Right, like we hear, like. It's namely a bi-lemma question. We see
that R is geometrically something. So it's actually pretty crazy.

00:34:09.496 --> 00:34:12.913
It also sometimes tries to make diagrams.

00:34:14.239 --> 00:34:19.313
For those of you that have taken algebraic topology, you know that these
commutative diagrams are kind of a thing that you work with a lot

00:34:19.313 --> 00:34:26.379
So it kind of got the general gist of how to make those
diagrams, but they actually don't make any sense.

00:34:26.380 --> 00:34:32.728
And actually, one of my favorite examples
here is that it sometimes omits proofs.

00:34:32.728 --> 00:34:39.694
So it'll sometimes say, it'll sometimes say something like
theorem, blah, blah, blah, blah, blah, proof omitted.

00:34:39.695 --> 00:34:45.392
This thing kind of has gotten the gist of
how some of these math textbooks look like.

00:34:47.831 --> 00:34:53.497
We can have a lot of fun with this. So we also tried training one
of these models on the entire source code of the Linux kernel.

00:34:53.498 --> 00:34:56.801
'Cause again, this character level stuff
that we can train on,

00:34:56.801 --> 00:35:01.483
And then, when we sample this, it
acutally again looks like C source code.

00:35:01.483 --> 00:35:06.192
- It knows how to write if statements.
- It has, like, pretty good code formatting skills.

00:35:06.192 --> 00:35:09.552
It knows to indent after these if
statements. It knows to put curly braces.

00:35:09.552 --> 00:35:15.052
It actually even makes comments about
some things that are usually nonsense.

00:35:15.052 --> 00:35:23.355
One problem with this model is that it knows how to declare
variables. But it doesn't always use the variables that it declares.

00:35:23.355 --> 00:35:27.554
And sometimes it tries to use variables that
haven't been declared. This wouldn't compile.

00:35:27.554 --> 00:35:30.555
I would not recommend sending this
as a pull request to Linux.

00:35:30.555 --> 00:35:37.867
This thing also figures out how to recite the
GNU, this GNU license character by character.

00:35:37.867 --> 00:35:44.962
It kind of knows that you need to recite the GNU license and after the
license comes some includes, then some other includes, then source code.

00:35:44.962 --> 00:35:48.836
This thing has actually learned quite a lot
about the general structure of the data.

00:35:48.836 --> 00:35:53.398
Where, again, during training, all we asked this model to
do was try to predict the next character in the sequence.

00:35:53.398 --> 00:35:55.406
We didn't tell it any of this structure,

00:35:55.406 --> 00:36:03.355
but somehow, just through the course of this training process, it
learned a lot about the latent structure in the sequential data.

00:36:05.808 --> 00:36:10.246
Yeah, so it knows how to write code.
It does a lot of cool stuff.

00:36:10.246 --> 00:36:20.856
I had this paper with Andre a couple years ago where we trained a bunch of these models and then we wanted to
try to poke into the brains of these models and figure out like what are they doing and why are they working.

00:36:20.856 --> 00:36:28.165
So we saw, in our, these recurring neural networks has this hidden
vector which is, maybe, some vector that's updated over every time step.

00:36:28.165 --> 00:36:34.176
And then what we wanted to try to figure out is could we find some
elements of this vector that have some Symantec interpretable meaning.

00:36:34.176 --> 00:36:39.917
So what we did is we trained a neural network language model,
one of these character level models on one of these data sets,

00:36:39.917 --> 00:36:47.923
and then we picked one of the elements in that hidden vector and now we look
at what is the value of that hidden vector over the course of a sequence

00:36:47.923 --> 00:36:52.206
to try to get some sense of maybe what these
different hidden states are looking for.

00:36:52.206 --> 00:36:56.406
When you do this, a lot of them end up looking
kind of like random gibberish garbage.

00:36:56.406 --> 00:37:02.686
So here again, what we've done, is we've picked one element of that
vector, and now we run the sequence forward through the trained model,

00:37:02.686 --> 00:37:10.876
and now the color of each character corresponds to the magnitude of that single scaler
element of the hidden vector at every time step when it's reading the sequence.

00:37:10.876 --> 00:37:15.883
So you can see that a lot of the vectors in these
hidden states are kind of not very interpretable.

00:37:15.883 --> 00:37:21.396
It seems like they're kind of doing some of this low level
language modeling to figure out what character should come next.

00:37:21.396 --> 00:37:23.438
But some of them end up quite nice.

00:37:23.438 --> 00:37:26.375
So here we found this vector
that is looking for quotes.

00:37:26.375 --> 00:37:32.486
You can see that there's this one hidden element, this one
element in the vector, that is off, off, off, off, off blue

00:37:32.486 --> 00:37:38.303
and then once it hits a quote, it turns on and
remains on for the duration of this quote.

00:37:39.187 --> 00:37:42.598
And now when we hit the second quotation
mark, then that cell turns off.

00:37:42.598 --> 00:37:48.856
So somehow, even though this model was only trained to predict the
next character in a sequence, it somehow learned that a useful thing,

00:37:48.856 --> 00:37:54.222
in order to do this, might be to have
some cell that's trying to detect quotes.

00:37:54.222 --> 00:38:00.104
We also found this other cell that is, looks like it's
counting the number of characters since a line break.

00:38:00.104 --> 00:38:07.055
So you can see that at the beginning of each line, this element starts
off at zero. Throughout the course of the line, it's gradually more red,

00:38:07.055 --> 00:38:11.976
so that value increases. And then after the
new line character, it resets to zero.

00:38:11.976 --> 00:38:19.545
So you can imagine that maybe this cell is letting the network keep
track of when it needs to write to produce these new line characters.

00:38:19.545 --> 00:38:22.987
We also found some that,
when we trained on the linux source code,

00:38:22.987 --> 00:38:33.838
we found some examples that are turning on inside the conditions of if statements. So this maybe
allows the network to differentiate whether it's outside an if statement or inside that condition,

00:38:33.838 --> 00:38:35.806
which might help it model
these sequences better.

00:38:35.806 --> 00:38:44.765
We also found some that turn on in comments, or some that
seem like they're counting the number of indentation levels.

00:38:44.765 --> 00:38:50.758
This is all just really cool stuff because it's saying that even though
we are only trying to train this model to predict next characters,

00:38:50.758 --> 00:38:55.646
it somehow ends up learning a lot
of useful structure about the input data.

00:38:57.161 --> 00:39:05.528
One kind of thing that we often use, so this is not really been computer vision so
far, and we need to pull this back to computer vision since this is a vision class.

00:39:05.528 --> 00:39:08.747
We've alluded many times to this
image captioning model

00:39:08.747 --> 00:39:14.376
- where we want to build models that can input
- an image and then output a caption in natural language.

00:39:14.376 --> 00:39:19.787
There were a bunch of papers a couple years ago
that all had relatively similar approaches.

00:39:19.787 --> 00:39:25.026
But I'm showing the figure from the paper
from our lab in a totally un-biased way.

00:39:26.876 --> 00:39:35.608
But, the idea here is that the caption is this variably length sequence that we
might, the sequence might have different numbers of words for different captions.

00:39:35.608 --> 00:39:39.947
So this is a totally natural fit for a
recurrent neural network language model.

00:39:39.947 --> 00:39:50.379
So then what this model looks like is we have some convolutional network which will input the, which
will take as input the image, and we've seen a lot about how convolution networks work at this point,

00:39:50.379 --> 00:39:58.928
and that convolutional network will produce a summary vector of the image which will then
feed into the first time step of one of these recurrent neural network language models

00:39:58.928 --> 00:40:02.888
which will then produce words
of the caption one at a time.

00:40:02.888 --> 00:40:06.425
So the way that this kind of works at
test time after the model is trained

00:40:06.425 --> 00:40:11.178
looks almost exactly the same as these character
level language models that we saw a little bit ago.

00:40:11.178 --> 00:40:18.638
We'll take our input image, feed it through our convolutional network.
But now instead of taking the softmax scores from an image net model,

00:40:18.638 --> 00:40:24.915
we'll instead take this 4,096 dimensional
vector from the end of the model,

00:40:24.915 --> 00:40:31.377
and we'll take that vector and use it to
summarize the whole content of the image.

00:40:31.377 --> 00:40:39.528
Now, remember when we talked about RNN language models, we said that we need to see
the language model with that first initial input to tell it to start generating text.

00:40:39.528 --> 00:40:49.006
So in this case, we'll give it some special start token, which is just saying, hey, this is the
start of a sentence. Please start generating some text conditioned on this image information.

00:40:49.006 --> 00:40:57.107
So now previously, we saw that in this RNN language model, we had these
matrices that were taking the previous, the input at the current time step

00:40:57.107 --> 00:41:01.240
and the hidden state of the previous time step and
combining those to get the next hidden state.

00:41:01.240 --> 00:41:04.678
Well now, we also need to add
in this image information.

00:41:04.678 --> 00:41:13.105
So one way, people play around with exactly different ways to incorporate this
image information, but one simple way is just to add a third weight matrix that is

00:41:14.561 --> 00:41:18.598
adding in this image information at every
time step to compute the next hidden state.

00:41:18.598 --> 00:41:23.635
So now, we'll compute this distribution
over all scores in our vocabulary

00:41:23.635 --> 00:41:27.307
and here, our vocabulary is something like all
English words, so it could be pretty large.

00:41:27.307 --> 00:41:32.099
We'll sample from that distribution and now pass
that word back as input at the next time step.

00:41:32.099 --> 00:41:39.546
And that will then feed that word in, again get a distribution over
all words in the vocab, and again sample to produce the next word.

00:41:39.546 --> 00:41:47.538
So then, after that thing is all done, we'll maybe
generate, we'll generate this complete sentence.

00:41:47.538 --> 00:41:54.347
We stop generation once we sample the special ends token, which
kind of corresponds to the period at the end of the sentence.

00:41:54.347 --> 00:42:00.328
Then once the network samples this ends token, we stop generation
and we're done and we've gotten our caption for this image.

00:42:00.328 --> 00:42:06.739
And now, during training, we trained this thing to generate, like
we put an end token at the end of every caption during training

00:42:06.739 --> 00:42:10.907
so that the network kind of learned during training
that end tokens come at the end of sequences.

00:42:10.907 --> 00:42:16.848
So then, during test time, it tends to sample
these end tokens once it's done generating.

00:42:16.848 --> 00:42:20.139
So we trained this model
in kind of a completely supervised way.

00:42:20.139 --> 00:42:24.763
You can find data sets that have images
together with natural language captions.

00:42:24.763 --> 00:42:28.666
Microsoft COCO is probably the biggest
and most widely used for this task.

00:42:28.666 --> 00:42:47.837
But you can just train this model in a purely supervised way. And then backpropagate through to jointly train both this recurrent neural network language model and then
also pass gradients back into this final layer of this the CNN and additionally update the weights of the CNN to jointly tune all parts of the model to perform this task.

00:42:47.837 --> 00:42:52.438
Once you train these models, they actually
do some pretty reasonable things.

00:42:52.438 --> 00:42:56.689
These are some real results from a model,
from one of these trained models,

00:42:56.689 --> 00:43:00.995
and it says things like a cat sitting
on a suitcase on the floor,

00:43:00.995 --> 00:43:02.583
which is pretty impressive.

00:43:02.583 --> 00:43:04.975
It knows about cats
sitting on a tree branch,

00:43:04.975 --> 00:43:06.499
which is also pretty cool.

00:43:06.499 --> 00:43:09.631
It knows about two people walking
on the beach with surfboards.

00:43:09.631 --> 00:43:16.635
So these models are actually pretty powerful and can
produce relatively complex captions to describe the image.

00:43:16.635 --> 00:43:19.808
But that being said,
these models are really not perfect.

00:43:19.808 --> 00:43:24.319
They're not magical.
Just like any machine learning model,

00:43:24.319 --> 00:43:28.830
if you try to run them on data that was very different
from the training data, they don't work very well.

00:43:28.830 --> 00:43:33.317
So for example, this example, it says
a woman is holding a cat in her hand.

00:43:33.317 --> 00:43:35.421
There's clearly no cat in the image.

00:43:35.421 --> 00:43:41.359
But she is wearing a fur coat, and maybe the texture
of that coat kind of looked like a cat to the model.

00:43:41.359 --> 00:43:44.379
Over here, we see a woman standing
on a beach holding a surfboard.

00:43:44.379 --> 00:43:47.892
Well, she's definitely not holding a
surfboard and she's doing a handstand,

00:43:47.892 --> 00:43:51.820
which is maybe the interesting part of that
image, and the model totally missed that.

00:43:51.820 --> 00:43:58.238
Also, over here, we see this example where there's this
picture of a spider web in the tree branch, and it totally,

00:43:58.238 --> 00:44:00.382
and it says something like
a bird sitting on a tree branch.

00:44:00.382 --> 00:44:05.139
So it totally missed the spider, but during
training, it never really saw examples of spiders.

00:44:05.139 --> 00:44:08.217
It just knows that birds sit
on tree branches during training.

00:44:08.217 --> 00:44:10.152
So it kind of makes these
reasonable mistakes.

00:44:10.152 --> 00:44:14.338
Or here at the bottom, it can't really tell the difference
between this guy throwing and catching the ball,

00:44:14.338 --> 00:44:17.709
but it does know that it's a baseball player
and there's balls and things involved.

00:44:17.709 --> 00:44:20.441
So again, just want to say that
these models are not perfect.

00:44:20.441 --> 00:44:25.109
They work pretty well when you ask them to caption
images that were similar to the training data,

00:44:25.109 --> 00:44:29.188
but they definitely have a hard time
generalizing far beyond that.

00:44:29.188 --> 00:44:34.152
So another thing you'll sometimes see is this
slightly more advanced model called Attention,

00:44:34.152 --> 00:44:41.036
where now when we're generating the words of this caption, we can allow
the model to steer it's attention to different parts of the image.

00:44:41.036 --> 00:44:44.670
And I don't want to spend
too much time on this.

00:44:44.670 --> 00:44:48.925
But the general way that this works is
that now our convolutional network,

00:44:48.925 --> 00:44:59.954
rather than producing a single vector summarizing the entire image, now it produces some grid of
vectors that summarize the, that give maybe one vector for each spatial location in the image.

00:44:59.954 --> 00:45:06.618
And now, when we, when this model runs forward, in
addition to sampling the vocabulary at every time step,

00:45:06.618 --> 00:45:11.263
it also produces a distribution over the
locations in the image where it wants to look.

00:45:11.263 --> 00:45:18.276
And now this distribution over image locations can be seen as a
kind of a tension of where the model should look during training.

00:45:18.276 --> 00:45:23.155
So now that first hidden state computes
this distribution over image locations,

00:45:23.155 --> 00:45:31.372
which then goes back to the set of vectors to give a single summary
vector that maybe focuses the attention on one part of that image.

00:45:31.372 --> 00:45:36.508
And now that summary vector gets fed, as an additional
input, at the next time step of the neural network.

00:45:36.508 --> 00:45:38.278
And now again, it will
produce two outputs.

00:45:38.278 --> 00:45:43.714
One is our distribution over vocabulary words. And
the other is a distribution over image locations.

00:45:43.714 --> 00:45:49.160
This whole process will continue, and it will sort
of do these two different things at every time step.

00:45:50.296 --> 00:45:58.298
And after you train the model, then you can see that it kind of will shift it's
attention around the image for every word that it generates in the caption.

00:45:58.298 --> 00:46:04.436
Here you can see that it produced the caption,
a bird is flying over, I can't see that far.

00:46:04.436 --> 00:46:12.473
But you can see that its attention is shifting around different
parts of the image for each word in the caption that it generates.

00:46:14.112 --> 00:46:19.147
There's this notion of hard attention versus soft
attention, which I don't really want to get into too much,

00:46:19.147 --> 00:46:26.614
but with this idea of soft attention, we're kind of taking a
weighted combination of all features from all image locations,

00:46:26.614 --> 00:46:34.259
whereas in the hard attention case, we're forcing the model to select
exactly one location to look at in the image at each time step.

00:46:34.259 --> 00:46:38.277
So the hard attention case where we're selecting
exactly one image location is a little bit tricky

00:46:38.277 --> 00:46:46.247
because that is not really a differentiable function, so you need to do something slightly
fancier than vanilla backpropagation in order to just train the model in that scenario.

00:46:46.247 --> 00:46:51.498
And I think we'll talk about that a little bit
later in the lecture on reinforcement learning.

00:46:51.498 --> 00:46:57.415
Now, when you look at after you train one of these
attention models and then run it on to generate captions,

00:46:57.415 --> 00:47:04.067
you can see that it tends to focus it's attention on maybe the salient
or semanticly meaningful part of the image when generating captions.

00:47:04.067 --> 00:47:08.413
You can see that the caption was a woman
is throwing a frisbee in a park

00:47:08.413 --> 00:47:14.516
and you can see that this attention mask, when it generated the
word, when the model generated the word frisbee, at the same time,

00:47:14.516 --> 00:47:18.976
it was focusing it's attention on this image
region that actually contains the frisbee.

00:47:18.976 --> 00:47:24.542
This is actually really cool. We did not tell the
model where it should be looking at every time step.

00:47:24.542 --> 00:47:27.955
It sort of figured all that out for itself
during the training process.

00:47:27.955 --> 00:47:32.721
Because somehow, it figured out that looking at that
image region was the right thing to do for this image.

00:47:32.721 --> 00:47:34.610
And because everything in
this model is differentiable,

00:47:34.610 --> 00:47:42.541
because we can backpropagate through all these soft attention steps, all
of this soft attention stuff just comes out through the training process.

00:47:42.541 --> 00:47:45.297
So that's really, really cool.

00:47:45.297 --> 00:47:51.565
By the way, this idea of recurrent neural networks and attention
actually gets used in other tasks beyond image captioning.

00:47:51.565 --> 00:47:54.691
One recent example is this idea
of visual question answering.

00:47:54.691 --> 00:48:04.010
So here, our model is going to take two things as input. It's going to take an image and
it will also take a natural language question that's asking some question about the image.

00:48:04.010 --> 00:48:10.163
Here, we might see this image on the left and we might ask the
question, what endangered animal is featured on the truck?

00:48:10.163 --> 00:48:19.027
And now the model needs to select from one of these four natural language answers about
which of these answers correctly answers that question in the context of the image.

00:48:19.027 --> 00:48:24.774
So you can imagine kind of stitching this model
together using CNNs and RNNs in kind of a natural way.

00:48:24.774 --> 00:48:29.701
Now, we're in this
many to one scenario,

00:48:29.701 --> 00:48:33.566
where now our model needs to take as input
this natural language sequence,

00:48:33.566 --> 00:48:40.111
so we can imagine running a recurrent neural network over each element of
that input question, to now summarize the input question in a single vector.

00:48:40.111 --> 00:48:43.928
And then we can have a CNN
to again summarize the image,

00:48:43.928 --> 00:48:51.645
and now combine both the vector from the CNN and the vector from the
question and coding RNN to then predict a distribution over answers.

00:48:51.645 --> 00:48:59.175
We also sometimes, you'll also sometimes see this idea of soft spacial
attention being incorporated into things like visual question answering.

00:48:59.175 --> 00:49:07.062
So you can see that here, this model is also having the spatial attention
over the image when it's trying to determine answers to the questions.

00:49:07.062 --> 00:49:09.062
Just to, yeah, question?

00:49:18.715 --> 00:49:20.878
So the question is
How are the different inputs combined?

00:49:20.878 --> 00:49:25.998
Do you mean like the encoded question
vector and the encoded image vector?

00:49:25.998 --> 00:49:30.395
Yeah, so the question is how are the encoded
image and the encoded question vector combined?

00:49:30.395 --> 00:49:34.847
Kind of the simplest thing to do is just to concatenate
them and stick them into fully connected layers.

00:49:34.847 --> 00:49:37.864
That's probably the most common and
that's probably the first thing to try.

00:49:37.864 --> 00:49:44.389
Sometimes people do slightly fancier things where they might try to have multiplicative
interactions between those two vectors to allow a more powerful function.

00:49:44.389 --> 00:49:48.050
But generally, concatenation is
kind of a good first thing to try.

00:49:49.426 --> 00:49:55.084
Okay, so now we've talked about a bunch of scenarios where RNNs are
used for different kinds of problems. And I think it's super cool

00:49:55.084 --> 00:50:04.072
because it allows you to start tackling really complicated problems
combining images and computer vision with natural language processing.

00:50:04.072 --> 00:50:08.870
And you can see that we can kind of stith together these
models like Lego blocks and attack really complicated things,

00:50:08.870 --> 00:50:16.736
Like image captioning or visual question answering just by stitching
together these relatively simple types of neural network modules.

00:50:18.708 --> 00:50:26.274
But I'd also like to mention that so far, we've talked about this idea of
a single recurrent network layer, where we have sort of one hidden state,

00:50:26.274 --> 00:50:32.581
and another thing that you'll see pretty commonly is
this idea of a multilayer recurrent neural network.

00:50:32.581 --> 00:50:44.554
Here, this is a three layer recurrent neural network, so now our input goes in, goes into, goes
in and produces a sequence of hidden states from the first recurrent neural network layer.

00:50:44.554 --> 00:50:50.468
And now, after we run kind of one recurrent neural network
layer, then we have this whole sequence of hidden states.

00:50:50.468 --> 00:50:56.474
And now, we can use the sequence of hidden states as an
input sequence to another recurrent neural network layer.

00:50:56.474 --> 00:51:01.577
And then you can just imagine, which will then produce
another sequence of hidden states from the second RNN layer.

00:51:01.577 --> 00:51:04.181
And then you can just imagine stacking
these things on top of each other,

00:51:04.181 --> 00:51:09.398
cause we know that we've seen in other contexts that
deeper models tend to perform better for various problems.

00:51:09.398 --> 00:51:18.215
And the same kind of holds in RNNs as well. For many problems, you'll see maybe
a two or three layer recurrent neural network model is pretty commonly used.

00:51:18.215 --> 00:51:22.086
You typically don't see
super deep models in RNNs.

00:51:22.086 --> 00:51:29.327
So generally, like two, three, four layer
RNNs is maybe as deep as you'll typically go.

00:51:29.327 --> 00:51:38.222
Then, I think it's also really interesting and important to think about,
now we've seen kind of what kinds of problems these RNNs can be used for,

00:51:38.222 --> 00:51:43.229
but then you need to think a little bit more carefully about
exactly what happens to these models when we try to train them.

00:51:43.229 --> 00:51:47.447
So here, I've drawn this little vanilla
RNN cell that we've talked about so far.

00:51:47.447 --> 00:51:55.307
So here, we're taking our current input, x t, and our previous hidden
state, h t minus one, and then we stack, those are two vectors.

00:51:55.307 --> 00:52:05.783
So we can just stack them together. And then perform this matrix multiplication with our weight matrix,
to give our, and then squash that output through a tanh, and that will give us our next hidden state.

00:52:05.783 --> 00:52:10.131
And that's kind of the basic functional form
of this vanilla recurrent neural network.

00:52:10.131 --> 00:52:17.738
But then, we need to think about what happens in this architecture
during the backward pass when we try to compute gradients?

00:52:17.738 --> 00:52:27.477
So then if we think about trying to compute, so then during the backwards pass, we'll
receive the derivative of our h t, we'll receive derivative of loss with respect to h t.

00:52:28.568 --> 00:52:34.632
And during the backward pass through the cell, we'll need to
compute derivative of loss to the respect of h t minus one.

00:52:34.632 --> 00:52:39.881
Then, when we compute this backward pass, we see that
the gradient flows backward through this red path.

00:52:39.881 --> 00:52:45.958
So first, that gradient will flow backwards through this tanh gate, and
then it will flow backwards through this matrix multiplication gate.

00:52:45.958 --> 00:52:54.711
And then, as we've seen in the homework and when implementing these matrix
multiplication layers, when you backpropagate through this matrix multiplication gate,

00:52:54.711 --> 00:52:58.457
you end up mulitplying by the transpose
of that weight matrix.

00:52:58.457 --> 00:53:08.024
So that means that every time we backpropagate through one of these vanilla
RNN cells, we end up multiplying by some part of the weight matrix.

00:53:09.110 --> 00:53:16.599
So now if you imagine that we are sticking many of these recurrent neural network
cells in sequence, because again this is an RNN. We want a model sequences.

00:53:16.599 --> 00:53:20.921
Now if you imagine what happens to the gradient
flow through a sequence of these layers,

00:53:20.921 --> 00:53:23.770
then something kind of
fishy starts to happen.

00:53:23.770 --> 00:53:30.810
Because now, when we want to compute the gradient of the loss with respect
to h zero, we need to backpropagate through every one of these RNN cells.

00:53:30.810 --> 00:53:35.641
And every time you backpropagate through one cell,
you'll pick up one of these w transpose factors.

00:53:35.641 --> 00:53:44.967
So that means that the final expression for the gradient on h zero will
involve many, many factors of this weight matrix, which could be kind of bad.

00:53:44.967 --> 00:53:50.138
Maybe don't think about the weight, the
matrix case, but imagine a scaler case.

00:53:50.138 --> 00:53:54.563
If we end up, if we have some scaler and we multiply
by that same number over and over and over again,

00:53:54.563 --> 00:54:00.269
maybe not for four examples, but for something
like a hundred or several hundred time steps,

00:54:00.269 --> 00:54:03.835
then multiplying by the same number
over and over again is really bad.

00:54:03.835 --> 00:54:14.069
In the scaler case, it's either going to explode in the case that that number is greater than one
or it's going to vanish towards zero in the case that number is less than one in absolute value.

00:54:14.069 --> 00:54:18.186
And the only way in which this will not
happen is if that number is exactly one,

00:54:18.186 --> 00:54:20.911
which is actually very
rare to happen in practice.

00:54:20.911 --> 00:54:25.229
That leaves us to, that same intuition
extends to the matrix case,

00:54:25.229 --> 00:54:32.071
but now, rather than the absolute value of a scaler number, you instead need
to look at the largest, the largest singular value of this weight matrix.

00:54:32.071 --> 00:54:46.079
Now if that largest singular value is greater than one, then during this backward pass, when we multiply by the weight
matrix over and over, that gradient on h w, on h zero, sorry, will become very, very large, when that matrix is too large.

00:54:46.079 --> 00:54:48.947
And that's something we call
the exploding gradient problem.

00:54:48.947 --> 00:54:55.061
Where now this gradient will explode exponentially in depth
with the number of time steps that we backpropagate through.

00:54:55.061 --> 00:55:05.349
And if the largest singular value is less than one, then we get the opposite problem, where now our gradients will
shrink and shrink and shrink exponentially, as we backpropagate and pick up more and more factors of this weight matrix.

00:55:05.349 --> 00:55:08.863
That's called the
vanishing gradient problem.

00:55:08.863 --> 00:55:14.208
THere's a bit of a hack that people sometimes do to fix
the exploding gradient problem called gradient clipping,

00:55:14.208 --> 00:55:19.713
which is just this simple heuristic saying that
after we compute our gradient, if that gradient,

00:55:19.713 --> 00:55:27.955
if it's L2 norm is above some threshold, then just clamp it down
and divide, just clamp it down so it has this maximum threshold.

00:55:27.955 --> 00:55:32.645
This is kind of a nasty hack, but it actually gets used in
practice quite a lot when training recurrent neural networks.

00:55:32.645 --> 00:55:39.034
And it's a relatively useful tool for
attacking this exploding gradient problem.

00:55:39.034 --> 00:55:46.207
But now for the vanishing gradient problem, what we typically do
is we might need to move to a more complicated RNN architecture.

00:55:46.207 --> 00:55:53.524
So that motivates this idea of an LSTM. An
LSTM, which stands for Long Short Term Memory,

00:55:53.524 --> 00:55:58.316
is this slightly fancier recurrence relation
for these recurrent neural networks.

00:55:58.316 --> 00:56:03.330
It's really designed to help alleviate this
problem of vanishing and exploding gradients.

00:56:03.330 --> 00:56:10.193
So that rather than kind of hacking on top of it, we just kind of
design the architecture to have better gradient flow properties.

00:56:10.193 --> 00:56:16.556
Kind of an analogy to those fancier CNN architectures
that we saw at the top of the lecture.

00:56:16.556 --> 00:56:20.807
Another thing to point out is that the
LSTM cell actually comes from 1997.

00:56:20.807 --> 00:56:24.073
So this idea of an LSTM
has been around for quite a while,

00:56:24.073 --> 00:56:28.852
and these folks were working on these ideas way back
in the 90s, were definitely ahead of the curve.

00:56:28.852 --> 00:56:32.475
Because these models are kind of
used everywhere now 20 years later.

00:56:33.864 --> 00:56:37.921
And LSTMs kind of have
this funny functional form.

00:56:37.921 --> 00:56:46.397
So remember when we had this vanilla recurrent neural network, it had this hidden state.
And we used this recurrence relation to update the hidden state at every time step.

00:56:46.397 --> 00:56:51.462
Well, now in an LSTM, we actually have two, we
maintain two hidden states at every time step.

00:56:51.462 --> 00:56:59.305
One is this h t, which is called the hidden state, which is kind
of an analogy to the hidden state that we had in the vanilla RNN.

00:56:59.305 --> 00:57:03.604
But an LSTM also maintains the second
vector, c t, called the cell state.

00:57:03.604 --> 00:57:12.240
And the cell state is this vector which is kind of internal, kept inside
the LSTM, and it does not really get exposed to the outside world.

00:57:12.240 --> 00:57:15.260
And we'll see, and you can kind of see
that through this update equation,

00:57:15.260 --> 00:57:20.803
where you can see that when we, first when
we compute these, we take our two inputs,

00:57:20.803 --> 00:57:25.485
we use them to compute these four gates
called i, f, o, n, g.

00:57:25.485 --> 00:57:33.802
We use those gates to update our cell states, c t, and then we expose
part of our cell state as the hidden state at the next time step.

00:57:36.704 --> 00:57:44.014
This is kind of a funny functional form, and I want to walk through for a couple
slides exactly why do we use this architecture and why does it make sense,

00:57:44.014 --> 00:57:47.731
especially in the context
of vanishing or exploding gradients.

00:57:47.731 --> 00:57:57.213
This first thing that we do in an LSTM is that we're given this previous
hidden state, h t, and we're given our current input vector, x t,

00:57:57.213 --> 00:57:58.611
and just like the vanilla RNN.

00:57:58.611 --> 00:58:08.663
In the vanilla RNN, remember, we took those two input vectors. We concatenated them.
Then we did a matrix multiply to directly compute the next hidden state in the RNN.

00:58:08.663 --> 00:58:15.315
Now, the LSTM does something a little bit different. We're going to
take our previous hidden state and our current input, stack them,

00:58:15.315 --> 00:58:21.926
and now multiply by a very big weight
matrix, w, to compute four different gates,

00:58:21.926 --> 00:58:24.391
Which all have the same
size as the hidden state.

00:58:24.391 --> 00:58:30.808
Sometimes, you'll see this written in different ways. Some
authors will write a different weight matrix for each gate.

00:58:30.808 --> 00:58:33.160
Some authors will combine them all
into one big weight matrix.

00:58:33.160 --> 00:58:34.677
But it's all really the same thing.

00:58:34.677 --> 00:58:40.288
The ideas is that we take our hidden state, our current
input, and then we use those to compute these four gates.

00:58:40.288 --> 00:58:47.768
These four gates are the, you often see this written as i, f, o,
g, ifog, which makes it pretty easy to remember what they are.

00:58:47.768 --> 00:58:53.655
I is the input gate. It says how much
do we want to input into our cell.

00:58:53.655 --> 00:59:00.194
F is the forget gate. How much do we want to forget the
cell memory at the previous, from the previous time step.

00:59:00.194 --> 00:59:04.653
O is the output gate, which is how much do we
want to reveal ourself to the outside world.

00:59:04.653 --> 00:59:10.130
And G really doesn't have a nice name,
so I usually call it the gate gate.

00:59:10.130 --> 00:59:14.626
G, it tells us how much do we want
to write into our input cell.

00:59:14.626 --> 00:59:19.665
And then you notice that each of these four
gates are using a different non linearity.

00:59:21.724 --> 00:59:25.047
The input, forget and output gate
are all using sigmoids,

00:59:25.047 --> 00:59:28.571
which means that their values
will be between zero and one.

00:59:28.571 --> 00:59:34.316
Whereas the gate gate uses a tanh, which means
it's output will be between minus one and one.

00:59:34.316 --> 00:59:41.725
So, these are kind of weird, but it makes a little bit
more sense if you imagine them all as binary values.

00:59:41.725 --> 00:59:44.744
Right, like what happens at the extremes
of these two values?

00:59:46.033 --> 00:59:53.926
It's kind of what happens, if you look after we compute these gates if you look at this next
equation, you can see that our cell state is being multiplied element wise by the forget gate.

00:59:53.926 --> 01:00:00.350
Sorry, our cell state from the previous time step is being multiplied
element wise by this forget gate. And now if this forget gate,

01:00:00.350 --> 01:00:03.861
you can think of it as being
a vector of zeros and ones,

01:00:03.861 --> 01:00:10.644
that's telling us for each element in the cell state, do we want to
forget that element of the cell in the case if the forget gate was zero?

01:00:10.644 --> 01:00:14.935
Or do we want to remember that element of the
cell in the case if the forget gate was one.

01:00:14.935 --> 01:00:19.767
Now, once we've used the forget gate
to gate off the part of the cell state,

01:00:19.767 --> 01:00:24.814
then we have the second term, which is
the element wise product of i and g.

01:00:24.814 --> 01:00:28.824
So now, i is this vector of zeros and ones,
cause it's coming through a sigmoid,

01:00:28.824 --> 01:00:35.728
telling us for each element of the cell state, do we want to write
to that element of the cell state in the case that i is one,

01:00:35.728 --> 01:00:41.719
or do we not want to write to that element of the cell
state at this time step in the case that i is zero.

01:00:41.719 --> 01:00:46.031
And now the gate gate, because it's coming
through a tanh, will be either one or minus one.

01:00:46.031 --> 01:00:54.022
So that is the value that we want, the candidate value that we might
consider writing to each element of the cell state at this time step.

01:00:54.022 --> 01:01:02.499
Then if you look at the cell state equation, you can see that at every time step,
the cell state has these kind of these different, independent scaler values,

01:01:02.499 --> 01:01:05.230
and they're all being incremented
or decremented by one.

01:01:05.230 --> 01:01:11.028
So there's kind of like, inside the cell state, we
can either remember or forget our previous state,

01:01:11.028 --> 01:01:16.535
and then we can either increment or decrement each element
of that cell state by up to one at each time step.

01:01:16.535 --> 01:01:25.708
So you can kind of think of these elements of the cell state as being little
scaler integer counters that can be incremented and decremented at each time step.

01:01:25.708 --> 01:01:36.513
And now, after we've computed our cell state, then we use our now updated cell
state to compute a hidden state, which we will reveal to the outside world.

01:01:36.513 --> 01:01:43.353
So because this cell state has this interpretation of being counters,
and sort of counting up by one or minus one at each time step,

01:01:43.353 --> 01:01:51.826
we want to squash that counter value into a nice zero to one range
using a tanh. And now, we multiply element wise, by this output gate.

01:01:51.826 --> 01:01:57.597
And the output gate is again coming through a sigmoid,
so you can think of it as being mostly zeros and ones,

01:01:57.597 --> 01:02:08.577
and the output gate tells us for each element of our cell state, do we want to reveal or not reveal
that element of our cell state when we're computing the external hidden state for this time step.

01:02:09.736 --> 01:02:17.132
And then, I think there's kind of a tradition in people trying to explain LSTMs,
that everyone needs to come up with their own potentially confusing LSTM diagram.

01:02:17.132 --> 01:02:18.882
So here's my attempt.

01:02:20.380 --> 01:02:23.866
Here, we can see what's going
on inside this LSTM cell,

01:02:23.866 --> 01:02:31.266
is that we take our, we're taking as input on the left our previous cell
state and the previous hidden state, as well as our current input, x t.

01:02:31.266 --> 01:02:37.346
Now we're going to take our current, our previous
hidden state, as well as our current input, stack them,

01:02:37.346 --> 01:02:41.166
and then multiply with this weight
matrix, w, to produce our four gates.

01:02:41.166 --> 01:02:44.836
And here, I've left out the non linearities
because we saw those on a previous slide.

01:02:44.836 --> 01:02:48.143
And now the forget gate multiplies
element wise with the cell state.

01:02:48.143 --> 01:02:54.524
The input and gate gate are multiplied element wise and
added to the cell state. And that gives us our next cell.

01:02:54.524 --> 01:03:01.366
The next cell gets squashed through a tanh, and multiplied element
wise with this output gate to produce our next hidden state.

01:03:02.417 --> 01:03:03.250
Question?

01:03:13.116 --> 01:03:17.878
No, So they're coming through this, they're coming
from different parts of this weight matrix.

01:03:17.878 --> 01:03:26.415
So if our hidden, if our x and our h all have this dimension
h, then after we stack them, they'll be a vector size two h,

01:03:26.415 --> 01:03:30.393
and now our weight matrix will be this
matrix of size four h times two h.

01:03:30.393 --> 01:03:41.511
So you can think of that as sort of having four chunks of this weight matrix. And each of
these four chunks of the weight matrix is going to compute a different one of these gates.

01:03:42.404 --> 01:03:51.109
You'll often see this written for clarity, kind of combining all four of those different
weight matrices into a single large matrix, w, just for notational convenience.

01:03:51.109 --> 01:03:56.067
But they're all computed using
different parts of the weight matrix.

01:03:57.080 --> 01:04:04.574
But you're correct in that they're all computed using the same functional
form of just stacking the two things and taking the matrix multiplication.

01:04:04.574 --> 01:04:11.196
Now that we have this picture, we can think about what
happens to an LSTM cell during the backwards pass?

01:04:11.196 --> 01:04:18.999
We saw, in the context of vanilla recurrent neural network, that some bad things happened
during the backwards pass, where we were continually multiplying by that weight matrix, w.

01:04:18.999 --> 01:04:23.238
But now, the situation looks much,
quite a bit different in the LSTM.

01:04:23.238 --> 01:04:29.952
If you imagine this path backwards of computing the
gradients of the cell state, we get quite a nice picture.

01:04:29.952 --> 01:04:33.320
Now, when we have our upstream gradient
from the cell coming in,

01:04:33.320 --> 01:04:37.737
then once we backpropagate backwards
through this addition operation,

01:04:37.737 --> 01:04:43.663
remember that this addition just copies that
upstream gradient into the two branches,

01:04:43.663 --> 01:04:50.435
so our upstream gradient gets copied directly and passed
directly to backpropagating through this element wise multiply.

01:04:50.435 --> 01:04:56.455
So then our upstream gradient ends up getting
multiplied element wise by the forget gate.

01:04:56.455 --> 01:05:07.171
As we backpropagate backwards through this cell state, the only thing that happens to our upstream
cell state gradient is that it ends up getting multiplied element wise by the forget gate.

01:05:07.171 --> 01:05:12.640
This is really a lot nicer
than the vanilla RNN for two reasons.

01:05:12.640 --> 01:05:18.498
One is that this forget gate is now an element wise
multiplication rather than a full matrix multiplication.

01:05:18.498 --> 01:05:24.964
So element wise multiplication is going to be a
little bit nicer than full matrix multiplication.

01:05:24.964 --> 01:05:31.354
Second is that element wise multiplication will potentially
be multiplying by a different forget gate at every time step.

01:05:31.354 --> 01:05:36.660
So remember, in the vanilla RNN, we were continually
multiplying by that same weight matrix over and over again,

01:05:36.660 --> 01:05:40.563
which led very explicitly
to these exploding or vanishing gradients.

01:05:40.563 --> 01:05:45.161
But now in the LSTM case, this forget
gate can vary from each time step.

01:05:45.161 --> 01:05:51.670
Now, it's much easier for the model to avoid these
problems of exploding and vanishing gradients.

01:05:51.670 --> 01:05:58.438
Finally, because this forget gate is coming out from a sigmoid, this
element wise multiply is guaranteed to be between zero and one,

01:05:58.438 --> 01:06:04.278
which again, leads to sort of nicer numerical properties if
you imagine multiplying by these things over and over again.

01:06:04.278 --> 01:06:14.273
Another thing to notice is that in the context of the vanilla recurrent neural network, we saw
that during the backward pass, our gradients were flowing through also a tanh at every time step.

01:06:14.273 --> 01:06:24.110
But now, in an LSTM, our outputs are, in an LSTM, our
hidden state is used to compute those outputs, y t,

01:06:24.110 --> 01:06:33.024
so now, each hidden state, if you imagine backpropagating from the final
hidden state back to the first cell state, then through that backward path,

01:06:33.024 --> 01:06:42.044
we only backpropagate through a single tanh non linearity
rather than through a separate tanh at every time step.

01:06:42.044 --> 01:06:50.483
So kind of when you put all these things together, you can see this backwards
pass backpropagating through the cell state is kind of a gradient super highway

01:06:50.483 --> 01:06:59.172
that lets gradients pass relatively unimpeded from the loss at the very end of the
model all the way back to the initial cell state at the beginning of the model.

01:06:59.172 --> 01:07:00.922
Was there a question?

01:07:02.901 --> 01:07:06.792
Yeah, what about the gradient in respect to w? 'Cause
that's ultimately the thing that we care about.

01:07:06.792 --> 01:07:19.848
So, the gradient with respect to w will come through, at every time step, will take our current cell state as well as our
current hidden state and that will give us an element, that will give us our local gradient on w for that time step.

01:07:19.848 --> 01:07:29.587
So because our cell state, and just in the vanilla RNN case, we'll end up
adding those first time step w gradients to compute our final gradient on w.

01:07:29.587 --> 01:07:37.280
But now, if you imagine the situation where we have a very long sequence,
and we're only getting gradients to the very end of the sequence.

01:07:37.280 --> 01:07:43.219
Now, as you backpropagate through, we'll get
a local gradient on w for each time step,

01:07:43.219 --> 01:07:48.506
and that local gradient on w will be
coming through these gradients on c and h.

01:07:48.506 --> 01:07:52.751
So because we're maintaining the gradients
on c much more nicely in the LSTM case,

01:07:52.751 --> 01:07:59.804
those local gradients on w at each time step will also be
carried forward and backward through time much more cleanly.

01:08:01.627 --> 01:08:03.044
Another question?

01:08:17.428 --> 01:08:22.088
Yeah, so the question is due to the non linearities,
could this still be susceptible to vanishing gradients?

01:08:22.089 --> 01:08:34.103
And that could be the case. Actually, so one problem you might imagine is that maybe if these forget gates are always less
than zero, or always less than one, you might get vanishing gradients as you continually go through these forget gates.

01:08:34.103 --> 01:08:42.746
Well, one sort of trick that people do in practice is that they will,
sometimes, initialize the biases of the forget gate to be somewhat positive.

01:08:42.746 --> 01:08:46.305
So that at the beginning of training, those
forget gates are always very close to one.

01:08:46.305 --> 01:08:56.631
So that at least at the beginning of training, then we have not so, relatively clean
gradient flow through these forget gates, since they're all initialized to be near one.

01:08:56.631 --> 01:09:02.741
And then throughout the course of training, then the model can
learn those biases and kind of learn to forget where it needs to.

01:09:02.742 --> 01:09:09.182
You're right that there still could be some potential for vanishing
gradients here. But it's much less extreme than the vanilla RNN case,

01:09:09.182 --> 01:09:12.126
both because those fs can
vary at each time step,

01:09:12.126 --> 01:09:19.126
and also because we're doing this element wise
multiplication rather than a full matrix multiplication.

01:09:19.126 --> 01:09:23.048
So you can see that this LSTM
actually looks quite similar to ResNet.

01:09:23.048 --> 01:09:28.069
In this residual network, we had this path of identity
connections going backward through the network

01:09:28.069 --> 01:09:32.303
and that gave, sort of a gradient super highway
for gradients to flow backward in ResNet.

01:09:32.303 --> 01:09:34.924
And now it's kind of the
same intuition in LSTM

01:09:34.924 --> 01:09:45.244
where these additive and element wise multiplicative interactions of the cell state can give a
similar gradient super highway for gradients to flow backwards through the cell state in an LSTM.

01:09:46.343 --> 01:09:55.558
And by the way, there's this other kind of nice paper called highway networks, which
is kind of in between this idea of this LSTM cell and these residual networks.

01:09:57.796 --> 01:10:01.165
So these highway networks
actually came before residual networks,

01:10:01.165 --> 01:10:08.804
and they had this idea where at every layer of the highway network, we're
going to compute sort of a candidate activation, as well as a gating function

01:10:08.804 --> 01:10:17.017
that tells us that interpolates between our previous input at that layer, and
that candidate activation that came through our convolutions or what not.

01:10:17.017 --> 01:10:20.472
So there's actually a lot of architectural
similarities between these things,

01:10:20.472 --> 01:10:27.688
and people take a lot of inspiration from training very deep
CNNs and very deep RNNs and there's a lot of crossover here.

01:10:27.688 --> 01:10:34.675
Very briefly, you'll see a lot of other types of variance of
recurrent neural network architectures out there in the wild.

01:10:34.675 --> 01:10:40.853
Probably the most common, apart from the LSTM,
is this GRU, called the gated recurrent unit.

01:10:40.853 --> 01:10:45.701
And you can see those update equations here, and
it kind of has this similar flavor of the LSTM,

01:10:45.701 --> 01:10:53.828
where it uses these multiplicative element wise gates together with
these additive interactions to avoid this vanishing gradient problem.

01:10:53.828 --> 01:10:59.734
There's also this cool paper called LSTM: a
search based oddysey, very inventive title,

01:10:59.734 --> 01:11:04.980
where they tried to play around with the LSTM equations
and swap out the non linearities at one point,

01:11:04.980 --> 01:11:07.343
like do we really need that tanh
for exposing the output gate,

01:11:07.343 --> 01:11:14.017
and they tried to answer a lot of these different questions about each of
those non linearities, each of those pieces of the LSTM update equations.

01:11:14.017 --> 01:11:18.068
What happens if we change the model and
tweak those LSTM equations a little bit.

01:11:18.068 --> 01:11:24.521
And kind of the conclusion is that they all work about the same Some of
them work a little bit better than others for one problem or another.

01:11:24.521 --> 01:11:32.148
But generally, none of the things, none of the tweaks of LSTM that they
tried were significantly better that the original LSTM for all problems.

01:11:32.148 --> 01:11:41.084
So that gives you a little bit more faith that the LSTM update equations seem kind of
magical but they're useful anyway. You should probably consider them for your problem.

01:11:41.084 --> 01:11:43.692
There's also this cool paper
from Google a couple years ago

01:11:43.692 --> 01:11:52.605
where they tried to use, where they did kind of an evolutionary search and did
a search over many, over a very large number of random RNN architectures,

01:11:52.605 --> 01:12:00.860
they kind of randomly premute these update equations and try putting the additions and the
multiplications and the gates and the non linearities in different kinds of combinations.

01:12:00.860 --> 01:12:08.570
They blasted this out over their huge Google cluster and just tried
a whole bunch of these different weigh updates in various flavors.

01:12:08.570 --> 01:12:15.446
And again, it was the same story that they didn't really find anything
that was significantly better than these existing GRU or LSTM styles.

01:12:15.446 --> 01:12:19.518
Although there were some variations that worked maybe
slightly better or worse for certain problems.

01:12:19.518 --> 01:12:27.080
But kind of the take away is that probably and using an
LSTM or GRU is not so much magic in those equations,

01:12:27.080 --> 01:12:33.356
but this idea of managing gradient flow properly through these
additive connections and these multiplicative gates is super useful.

01:12:34.888 --> 01:12:40.103
So yeah, the summary is that RNNs are super cool. They
can allow you to attack tons of new types of problems.

01:12:40.103 --> 01:12:43.431
They sometimes are susceptible to
vanishing or exploding gradients.

01:12:43.431 --> 01:12:47.412
But we can address that with weight
clipping and with fancier architectures.

01:12:47.412 --> 01:12:52.043
And there's a lot of cool overlap between
CNN architectures and RNN architectures.

01:12:52.043 --> 01:13:02.801
- So next time, you'll be taking the midterm. - But after that, we'll have a, sorry,
a question? Midterm is after this lecture so anything up to this point is fair game.